Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add function DropItemInOneWriteQueue to do the accurate queue clear when slave and master disconnect due to timeout #2666

Merged

Conversation

cheniujh
Copy link
Collaborator

@cheniujh cheniujh commented May 21, 2024

该 PR 修复了 Issue #2665

引发 Issue #2665 的原因

  • 1 当主节点和从节点由于超时断开连接时,主节点应该清除与超时DB相关的 Binlog-WriteQueue,但实际上主节点清空了与超时从节点相关的所有 WriteQueues(这些WriteQueue可能有的还在被处于Connected状态的DB所使用,正在传输Binlog)。举例:DB0发生了超时,DB1主从连接还正常且正在增量同步,DB0在超时处理的部分会一块把DB1使用的WriteQueue给清空,正确的行为是只清空DB0所对应的WriteQueue就好。
  • 2 类似于1,当从节点和主节点刚建立增量同步连接时,从节点会发送一个特殊的 "first-binlog-ack" 给主节点,以告知主节点该从哪里续传Binlog。主节点正确的操作是重置/清空发送了 "first-binlog-ack" 的那个DB所对应的 writeQueue,但主节点实际上重置/清空了发送"first-binlog-ack"的那个Slave对应的所有writeQueue。举例:DB0先行建立了连接进行增量同步,DB1没多久也建立了增量同步关系,但是DB1的first-binlog-ack会把DB0的WriteQueue也清空,问题是此时DB0的WriteQueue很有可能里面有内容。
  • 3 由于1,2提到的,Master意外清空了不相关的 WriteQueues 中的 binlog 项,主节点会漏发一批本该发送的 binlog(在WriteQueue中被意外清空的那些Binlog就是漏发的)。这也是为什么主节点会收到一个 AckEnd 小于 AckStart 的 BinlogAck(因为从节点的最新 binlog 偏移远远落后于正确的预期)

该 PR 如何修复此问题

  • 通过添加 "DropItemInOneWriteQueue" 函数来在上述场景中替代"DropItemInWriteQueue",确保主节点不会在上述场景中清空不相关的 WriteQueue。

This PR fixes Issue #2665

Causes of Issue #2665:

  • 1 When the master and slave nodes disconnect due to a timeout, the master is supposed to clear the Binlog-WriteQueue related to the timed-out DB. However, the master actually clears all WriteQueues associated with the timed-out SlaveNode, even those still being used by connected DBs for Binlog transmission. Example: DB0 experiences a timeout while DB1's master-slave connection is still active and performing incremental synchronization. The timeout handling for DB0 also clears DB1's WriteQueue. The correct behavior should be to clear only the WriteQueue corresponding to DB0.
  • 2 Similar to point 1, when the slave and master nodes establish an incremental synchronization connection, the slave sends a special "first-binlog-ack" to inform the master where to resume Binlog transmission. The correct action for the master is to reset/clear the WriteQueue for the DB associated with the "first-binlog-ack," but the master mistakenly resets/clears all WriteQueues for the SlaveNode that sent the "first-binlog-ack." Example: DB0 establishes a connection for incremental synchronization first, followed shortly by DB1. However, the "first-binlog-ack" from DB1 also clears the WriteQueue for DB0, which may still contain data.
  • 3 Due to the issues mentioned in points 1 and 2, the master unintentionally clears binlog items from unrelated WriteQueues, causing the master to miss sending some binlogs. This is why the master receives a BinlogAck with AckEnd smaller than AckStart (the slave's latest binlog offset is far behind the correct expectation).

How this PR fixes the Issue:

  • By adding a function "DropItemInOneWriteQueue" to replace "DropItemInWriteQueue" in the scenarios described above, ensuring that the master does not clear unrelated WriteQueues.

…the Master clean un-relevant WriteQueue when one DB timeout)
@github-actions github-actions bot added ☢️ Bug Something isn't working ✏️ Feature New feature or request labels May 21, 2024
@AlexStocks AlexStocks merged commit 6cd3e64 into OpenAtomFoundation:unstable May 24, 2024
25 checks passed
chenbt-hz pushed a commit to chenbt-hz/pika that referenced this pull request Jun 3, 2024
…the Master clean un-relevant WriteQueue when one DB timeout) (OpenAtomFoundation#2666)

Co-authored-by: cjh <[email protected]>
@cheniujh cheniujh deleted the AckEnd_smaller_than_AckStart branch June 24, 2024 03:21
@cheniujh cheniujh restored the AckEnd_smaller_than_AckStart branch June 25, 2024 11:59
chejinge pushed a commit that referenced this pull request Jul 31, 2024
…the Master clean un-relevant WriteQueue when one DB timeout) (#2666)

Co-authored-by: cjh <[email protected]>
cheniujh added a commit to cheniujh/pika that referenced this pull request Sep 24, 2024
…the Master clean un-relevant WriteQueue when one DB timeout) (OpenAtomFoundation#2666)

Co-authored-by: cjh <[email protected]>
cheniujh added a commit to cheniujh/pika that referenced this pull request Sep 24, 2024
…the Master clean un-relevant WriteQueue when one DB timeout) (OpenAtomFoundation#2666)

Co-authored-by: cjh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.5.5 4.0.0 ☢️ Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants